129 research outputs found

    Data Leakage via Access Patterns of Sparse Features in Deep Learning-based Recommendation Systems

    Full text link
    Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model size, there was not enough attention paid to the potential information leakage through sparse features. These sparse features are employed to track users' behavior, e.g., their click history, object interactions, etc., potentially carrying each user's private information. Sparse features are represented as learned embedding vectors that are stored in large tables, and personalized recommendation is performed by using a specific user's sparse feature to index through the tables. Even with recently-proposed methods that hides the computation happening in the cloud, an attacker in the cloud may be able to still track the access patterns to the embedding tables. This paper explores the private information that may be learned by tracking a recommendation model's sparse feature access patterns. We first characterize the types of attacks that can be carried out on sparse features in recommendation models in an untrusted cloud, followed by a demonstration of how each of these attacks leads to extracting users' private information or tracking users by their behavior over time

    Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

    Full text link
    Mixture-of-Experts (MoE) models have gained popularity in achieving state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves maximum throughput by 6.21-11.23×\times for LM, 5.75-10.98×\times for MT Encoder and 2.58-5.71×\times for MT Decoder. It also reduces memory usage by up to 1.36×\times for LM and up to 1.1×\times for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by up to 1.47×\times. We finally propose a load balancing methodology that provides additional scalability to the workload

    Thermal-aware 3D Microarchitectural Floorplanning

    Get PDF
    Next generation deep submicron processor design will need to take into consideration many performance limiting factors. Flip flops are inserted in order to prevent global wire delay from becoming nonlinear, enabling deeper pipelines and higher clock frequency. The move to 3D ICs will also likely be used to further shorten wirelength. This will cause thermal issues to become a major bottleneck to performance improvement. In this paper we propose a floorplanning algorithm which takes into consideration both thermal issues and profile weighted wirelength using mathematical programming. Our profile-driven objective improves performance by 20% over wirelength-driven. While the thermal-driven objective improves temperature by 24% on average over the profile-driven case

    GPU-based Private Information Retrieval for On-Device Machine Learning Inference

    Full text link
    On-device machine learning (ML) inference can enable the use of private user data on user devices without revealing them to remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. In particular, recommendation models typically use multiple embedding tables each on the order of 1-10 GBs of data, making them impractical to store on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) propose novel GPU-based acceleration of PIR, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than 20×20 \times over an optimized CPU PIR implementation, and our PIR-ML co-design provides an over 5×5 \times additional throughput improvement at fixed model quality. Together, for various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to 100,000100,000 queries per second -- a >100×>100 \times throughput improvement over a CPU-based baseline -- while maintaining model accuracy

    DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

    Full text link
    Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines

    Adaptive transaction scheduling for transactional memory systems

    Get PDF
    Transactional memory systems are expected to enable parallel programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations will actually become dominant in future workloads when large-scale transactions are frequently executed. In this thesis, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.M.S.Committee Chair: Lee, Hsien-Hsin; Committee Member: Blough, Douglas; Committee Member: Yalamanchili, Sudhaka

    Kernel Formula Approach to the Universal Whitham Hierarchy

    Full text link
    We derive the dispersionless Hirota equations of the universal Whitham hierarchy from the kernel formula approach proposed by Carroll and Kodama. Besides, we also verify the associativity equations in this hierarchy from the dispersionless Hirota equations and give a realization of the associative algebra with structure constants expressed in terms of the residue formulas.Comment: 18 page
    • …
    corecore